Biological Pattern Discovery with R Machine Learning Approaches (Zheng Rong Yang)

DEG. Each of such a case (or control) expression will be merged

ght cluster. Therefore the tight cluster will be adaptively updated

ng its size when scanning the case expressions. All case

ns will be scanned one by one till the point where a case

n cannot be treated as a member of the tight cluster and its

hip of the tight cluster is significantly denied. In other words, the

process of the algorithm is terminated when the remaining case

ns, which are treated as outliers, cannot be merged to the tight

An alternative to allow an outlier to present among the control

ns in DOG is to use the 9^th percentile of the control expressions

n initial tight cluster [Yang and Yang, 2013]. For a DEG, most

ressions will be significantly deviated from a tight cluster

ed based on the control expressions. Therefore, a process of

g for an outlier using the tight cluster approach will be terminated

early stage.

common that there is normally an overlap between the control

ns and the case expressions for a gene. Therefore, DOG has been

pdated in this chapter. Rather than using the 9^th percentile of the

xpressions, a new initial tight cluster is formed in a slightly

method. With this method, an initial tight cluster is formed in the

g way. The following equation defines an empirical standard

, where the coefficient 1.4826 was used in the COPA algorithm

, et al., 2005] and ߤ stands for the median of all expressions,

ߪොൌ1.4826 ൈmedianሺ|ܠെߤ|ሻ

(6.16)

e 6.17 shows the relationship between the true (or expected)

deviations (ߪ) and the estimated standard deviations (ߪො) using the

uation. The correlation between two sets of standard deviations

t 0.97 meaning that an estimated standard deviation can be a very

roximate to the true or expected standard deviation and can be

n initial standard deviation to start a tight cluster in the algorithm.